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Abstract 

This paper considers the modeling and the analysis of the performance of lock-free concurrent 
data structures. Lock-free designs employ an optimistic conflict control mechanism, allowing several 
processes to access the shared data object at the same time. They guarantee that at least one 
concurrent operation finishes in a finite number of its own steps regardless of the state of the 
operations. Our analysis considers such lock-free data structures that can be represented as linear 
combinations of fixed size retry loops. 

Our main contribution is a new way of modeling and analyzing a general class of lock-free 
algorithms, achieving predictions of throughput that are close to what we observe in practice. 
We emphasize two kinds of conflicts that shape the performance: (i) hardware conflicts, due to 
concurrent calls to atomic primitives; (ii) logical conflicts, caused by simultaneous operations on 
the shared data structure. 

We show how to deal with these hardware and logical conflicts separately, and how to combine 
them, so as to calculate the throughput of lock-free algorithms. We propose also a common 
framework that enables a fair comparison between lock-free implementations by covering the whole 
contention domain, together with a better understanding of the performance impacting factors. 
This part of our analysis comes with a method for calculating a good back-off strategy to finely 
tune the performance of a lock-free algorithm. Our experimental results, based on a set of widely 
used concurrent data structures and on abstract lock-free designs, show that our analysis follows 
closely the actual code behavior. 
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I. Introduction 

Lock-free programming provides highly concurrent access to data and has been increasing its 
footprint in industrial settings. Providing a modeling and an analysis framework capable of describing 
the practical performance of lock-free algorithms is an essential, missing resource necessary to the 
parallel programming and algorithmic research communities in their effort to build on previous 
intellectual efforts. The definition of lock-freedom mainly guarantees that at least one concurrent 
operation on the data structure finishes in a finite number of its own steps, regardless of the state of 
the operations. On the individual operation level, lock-freedom cannot guarantee that an operation 
will not starve. 

The goal of this paper is to provide a way to model and analyze the practically observed 
performance of lock-free data structures. In the literature, the common performance measure of 
a lock-free data structure is the throughput, i.e. the number of successful operations per unit of 
time. It is obtained while threads are accessing the data structure according to an access pattern 
that interleaves local work between calls to consecutive operations on the data structure. Although 
this access pattern to the data structure is significant, there is no consensus in the literature on 
what access to be used when comparing two data structures. So, the amount of local work (that we 
will refer as parallel work for the rest of the paper) could be constant f |MS96] . [SLOnj l. uniformly 
distributed f |HSY10] . |DLM13j L exponentially distributed f |Val94j . |DB08j L null f [KH14] . |LJ13] L 
etc., and more questionably, the average amount is rarely scanned, which leads to a partial covering 
of the contention domain. 

We propose here a common framework enabling a fair comparison between lock-free data 
structures, while exhibiting the main phenomena that drive performance, and particularly the 
contention, which leads to different kinds of conflicts. As this is the first step in this direction, 
we want to deeply analyze the core of the problem, without impacting factors being diluted within 
a probabilistic smoothing. Therefore, we choose a constant local work, hence constant access rate 
to the data structures. In addition to the prediction of the data structure performance, our model 
provides a good back-off strategy, that achieves the peak performance of a lock-free algorithm. 

Two kinds of conflict appear during the execution of a lock-free algorithm, both of them leading 
to additional work. Hardware conflicts occur when concurrent operations call atomic primitives on 
the same data: these calls collide and conduct to stall time, that we name here expansion. Logical 
conflicts take place if concurrent operations overlap: because of the lock-free nature of the algorithm, 
several concurrent operations can run simultaneously, but only one retry can logically succeed. We 
show that the additional work produced by the failures is not necessarily harmful for the system-wise 
performance. 

We then show how throughput can be computed by connecting these two key factors in an iterative 
way. We start by estimating the expansion probabilistically, and emulate the effect of stall time 
introduced by the hardware conflicts as extra work added to each thread. Then we estimate the 
number of failed operations, that in turn lead to additional extra work, by computing again the 
expansion on a system setting where those two new amounts of work have been incorporated, and 
reiterate the process; the convergence is ensured by a fixed-point search. 

We consider the class of lock-free algorithms that can be modeled as a linear composition of fixed 
size retry loops. This class covers numerous extensively used lock-free designs such as stacks |Tre86] 
(Pop, Push), queues |MS96] (Enqueue, Dequeue), counters [DLM13j (Increment, Decrement) and 
priority queues |L,T13| (DeleteMin). 

To evaluate the accuracy of our model and analysis framework, we performed experiments both 
on synthetic tests, that capture a wide range of possible abstract algorithmic designs, and on several 
reference implementations of extensively studied lock-free data structures. Our evaluation results 
reveal that our model is able to capture the behavior of all the synthetic and real designs for 
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all different numbers of threads and sizes of parallel work (consequently also contention). We also 
evaluate the use of our analysis as a tool for tuning the performance of lock-free code by selecting 
the appropriate back-off strategy that will maximize throughput by comparing our method with 
against widely known back-off policies, namely linear and exponential. 

The rest of the paper is organized as follows. We discuss related work in Section [T^ then the 
problem is formally described in Section III We consider the logical conflicts in the absence of 


hardware conflicts in Section El while in Section |Vj we firstly show how to compute the expansion, 
then combine hardware and logical conflicts to obtain the final throughput estimate. We describe 


the experimental results in Section VI 


II. Related Work 

Anderson et al. |AR,,197] evaluated the performance of lock-free objects in a single processor real¬ 
time system by emphasizing the impact of retry loop interference. Tasks can be preempted during 
the retry loop execution, which can lead to interference, and consequently to an inflation in retry 
loop execution due to retries. They obtained upper bounds for the number of interferences under 
various scheduling schemes for periodic real-time tasks. 

Intel |Intl3] conducted an empirical study to illustrate performance and scalability of locks. They 
showed that the critical section size, the time interval between releasing and re-acquiring the lock 
(that is similar to our parallel section size) and number of threads contending the lock are vital 
parameters. 

Failed retries do not only lead to useless effort but also degrade the performance of successful 
ones by contending the shared resources. Alemany et al. |AF92] have pointed out this fact, that is in 
accordance with our two key factors, and, without trying to model it, have mitigated those effects 
by designing non-blocking algorithms with operating system support. 

Alistarh et al. | ACS 14] have studied the same class of lock-free structures that we consider in this 
paper. The analysis is done in terms of scheduler steps, in a system where only one thread can be 
scheduled (and can then run) at each step. If compared with execution time, this is particularly 
appropriate to a system with a single processor and several threads, or to a system where the 
instructions of the threads cannot be done in parallel {e.g. multi-threaded program on a multi-core 
processor with only read and write on the same cache line of the shared memory). In our paper, 
the execution is evaluated in terms of processor cycles, strongly related to the execution time. In 
addition, the “parallel work” and the “critical work” can be done in parallel, and we only consider 
retry-loops with one Read and one CAS, which are serialized. In addition, they bound the asymptotic 
expected system latency (with a big O, when the number of threads tends to inhnity), while in our 
paper we estimate the throughput (close to the inverse of system latency) for any number of threads. 


III. Problem Statement 
A. Running Program and Targeted Platform 

In this paper, we aim at evaluating the throughput of a multi-threaded algorithm that is based 
on the utilization of a shared lock-free data structure. Such a program can be abstracted by the 
Procedure jAbstract Algorithm (see Figure that represents the skeleton of the function which is 
called by each spawned thread. It is decomposed in two main phases: the parallel section, represented 
on line and the retry loop, from line to line A retry starts at line and ends at line 

As for line [T] the function Initialization shall be seen as an abstraction of the delay between the 
spawns of the threads, that is expected not to be null, even when a barrier is used. We then consider 
that the threads begin at the exact same time, but have different initialization times. 

The parallel section is the part of the code where the thread does not access the shared data 
structure; the work that is performed inside this parallel section can possibly depend on the value 
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Procedure Abstract Algorithm 


1 lnitialization(); 

2 while / done do 


Parallel_Work(); 
while / success do 

current ^ Read(AP); 

new Critical_Work(current); 

success ^ CAS(AP, current, new); 


Figure 1: Thread procedure 


To I - ^^ - 1 I -^ ^^ ^ - 

Ti I - ^ ^ ^ ^^ - 1 I - ^ ^ - 1 I - 

T2 I - ^ ^ - 1 I - ^ ^ -1 I - ^^ - 1 H 

Ts I - 1 I - ^ ^^ ^ ^ ^ - 1 


<- Cycle-> 

Figure 2: Execution with one wasted retry, and one inevitable failure 


that has been read from the data structure, e.g. in the case of processing an element that has been 
dequeued from a FIFO (First-In-First-Out) queue. 

In each retry, a thread tries to modify the data structure, and does not exit the retry loop until 
it has successfully modihed the data structure. It does that by firstly reading the access point AP of 
the data structure, then according to the value that has been read, and possibly to other previous 
computations that occurred in the past, the thread prepares the new desired value as an access 
point of the data structure. Finally, it atomically tries to perform the change through a call to the 
Compare-And-Swap {CAS) primitive. If it succeeds, i.e. if the access point has not been changed 
by another thread between the hrst Read and the CAS, then it goes to the next parallel section, 
otherwise it repeats the process. The retry loop is composed of at least one retry, and we number 
the retries starting from 0, since the hrst iteration of the retry loop is actually not a retry, but a 
try. __ 

We analyze the behavior of Abstract Algorithm from a throughput perspective, which is dehned 


To I - ^ ^^ ^ - h 

Ti H - ^ ^ ^ ^^ - 

T2 H - ^ ^ ^ ^^ - h 

TsH - ^^ ^^^ - 1 

<-Cycle-> 


Figure 3: Execution with minimum number of failures 
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as the number of successful data structure operations per unit of time. In the context of Proce¬ 
dure Abstract Algorithm it is equivalent to the number of successful CA^s. 


The throughput of the lock-free algorithm, that we denote by T, is impacted by several parameters. 

• Algorithm parameters: the amount of work inside a call to Parallel_Work (resp. Critical_Work) 
denoted by pw (resp. cw). 

• Platform parameters: Read and CAS latencies (re and cc respectively), and the number P of 
processing units (cores). We assume homogeneity for the latencies, i.e. every thread experiences 
the same latency when accessing an uncontended shared data, which is achieved in practice by 
pinning threads to the same socket. 


B. Examples and Issues 

We first present two straightforward upper bounds on the throughput, and describe the two kinds 
of conflict that keep the actual throughput away from those upper bounds. 

1) Immediate Upper Bounds: Trivially, the minimum amount of work rlw^~'^ in a given retry is 
= re + cw + cc, as we should pay at least the memory accesses and the critical work cw in 
between. 

Thread-wise: A given thread can at most perform one successful retry every pw + rlw^"'^ units 
of time. In the best case, P threads can then lead to a throughput of P/{pw + rlw^~'^). 

System-wise: By dehnition, two successful retries cannot overlap, hence we have at most 1 
successful retry every rlw^"^ units of time. 

Altogether, the throughput T is bounded by 

f I P \ 

T < min -,- , i.e. 

\rc + cw + cc pw + re + cw + ccj 


1 


T < 


rc-\-cw-^cc 
pw-\- rc+ cw-\- cc 


if pw < {P — l)(rc + cw + cc) 
otherwise. 


( 1 ) 


2) Conflicts: 

Logical conflicts: Equationexpresses the fact that when pw is small enough, i.e. when pw < 
[P — l)rlw^''\ we cannot expect that every thread performs a successful retry every pw+ rlw^ ^ units 
of time, since it is more than what the retry loop can afford. As a result, some logical conflicts, 
hence unsuccessful retries, will be inevitable, while the others, if any, are called wasted. 

However, different executions can lead to different numbers of failures, which end up with different 
throughput values. Figures and depict two executions, where the black parts are the calls to 
Initialization, the blue parts are the parallel sections, and the retries can be either unsuccessful — in 
red — or successful — in green. We experiment different initialization times, and observe different 
synchronizations, hence different numbers of wasted retries. After the initial transient state, the 
execution depicted in Figure [^comprises only the inevitable unsuccessful retries, while the execution 
of Figure [^contains one wasted retry. 

We can see on those two examples that a cyclic execution is reached after the transient behavior; 


actually, we show in Section IV that, in the absence of hardware conflicts, every execution will become 


periodic, if the initialization times are spaced enough. In addition, we prove that the shortest period 
is such that, during this period, every thread succeeds exactly once. This hnally leads us to dehne 
the additional failures as wasted, since we can directly link the throughput with this number of 
wasted retries: a higher number of wasted retries implying a lower throughput. 
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CAS 



Read & cw 

Expansion 

Previously 
expanded CAS 


Figure 4: Expansion 


Hardware conflicts: The requirement of atomicity compels the ownership of the data in an 
exclusive manner by the executing core. This fact prohibits concurrent execution of atomic instruc¬ 
tions if they are operating on the same data. Therefore, overlapping parts of atomic instructions 
are serialized by the hardware, leading to stalls in subsequently issued ones. For our target lock-free 
algorithm, these stalls that we refer to as expansion become an important slowdown factor in case 
threads interfere in the retry loop. As illustrated in Figure]^ the latency for CAS can expand and 
cause remarkable decreases in throughput since the CAS of a successful thread is then expanded by 
others; for this reason, the amount of work inside a retry is not constant, but is, generally speaking, 
a function depending on the number of threads that are inside the retry loop. 

3) Process: We deal with the two kinds of conflicts separately and connect them together through 
the fixed-point iterative convergence. 

In Section |V-A[ we compute the expansion in execution time of a retry, noted e, by following 
a probabilistic approach. The estimation takes as input the expected number of threads inside the 
retry loop at any time, and returns the expected increase in the execution time of a retry due to the 
serialization of atomic primitives. 

In Section IV we are given program without hardware conflict described by the size of the parallel 
section pw^A and the size of a retry rlw^A _ We compute upper and lower bounds on the throughput 
T, the number of wasted retries m, and the average number of threads inside the retry loop Pri- 
Without loss of generality, we can normalize those execution times by the execution time of a retry, 
and dehne the parallel section size as pw^A = q-\- where g is a non-negative integer and r is such 
that 0 < r < 1. This pair (together with the number of threads P) constitutes the actual input of 
the estimation. 

Finally, we combine those two outcomes in Section V-B by emulating expansion through work not 
prone to hardware conflicts and obtain the full estimation of the throughput. 


IV. Execution without hardware conflict 

We show in this section that, in the absence of hardware conflicts, the execution becomes periodic, 
which eases the calculation of the throughput. We start by dehning some useful concepts: (/, P)- 
cyclic executions are special kind of periodic executions such that within the shortest period, each 
thread performs exactly / unsuccessful retries and 1 successful retry. The well-formed seed is a set 
of events that allows us to detect an (/, P)-cyclic execution early, and the gaps are a measure of the 
quality of the synchronization between threads. The idea is to iteratively add threads into the game 
and show that the periodicity is maintained. Theorem establishes a fundamental relation between 
gaps and well-formed seeds, while Theorem proves the periodicity, relying on the disjoint cases 
of Lemma 00 and 1 ^ Finally, we exhibit upper and lower bounds on throughput and number of 
failures, along with the average number of threads inside the retry loop. 


A. Setting 

1) Initial Restrictions: 
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Remark 1. Concerning correctness, we assume that the reference point of the Read and the CAS 
occurs when the thread enters and exits any retry, respectively 


Remark 2. We do not consider simultaneous events, so all inequalities that refer to time comparison 
are strict, and can be viewed as follows: time instants are real numbers, and can be equal, but every 
event is associated with a thread; also, in order to obtain a strict order relation, we break ties 
according to the thread numbers (for instance with the relation <). 


2) Nota tions and Definition s: We recall that P threads are executing the pseudo-code described in 
Procedure AbstractAlgorithm one retry is of unit-size, and the parallel section is of size = q+r, 
where g is a non-negative integer and r is such that 0 < r < 1. Considering a thread Tn which succeeds 
at time Sn] this thread completes a whole retry in 1 unit of time, then executes the parallel section of 
size and attempts to perform again the operation every unit of time, until one of the attempt 

is successful. 


Definition 1. An execution with P threads is called {C, P)-cyclic execution if and only if (i) the 
execution is periodic, i.e. at every time, every thread is in the same state as one period before, (ii) 
the shortest period contains exactly one successful attempt per thread, (Hi) the shortest period is 
1 + q + r + C. 

Definition 2. Let S = (71, 5'i)^g|g where 71 are threads and Si ordered times, i.e. such that 
So < Si < ■ ■ ■ < Sp-i. S is a seed if and only if for all z G [[0, P — Ij, % does not succeed between 
So and Si, and starts a retry at Si. 

We define f { S ) as the smallest non-negative integer such that So -\-1 + q + r + f ( S ) > Pp-i-l-l, 
i.e. f { S ) = max (0, [Pp-i — Aq — — t-] ). When S is clear from the context, we denote f { S ) by f. 

Definition 3. S is a well-formed seed if and only if for each z G [[0, P — Ij, the execution of thread Pi 
contains the following sequence: firstly a success beginning at Si, the parallel section, f unsuccessful 
retries, and finally a successful retry. 

Those dehnitions are coupled through the two natural following properties: 

Property 1. Given a {C, P)-cyclic execution, any seed S including P consecutive successes is a 
well-formed seed, with f { S ) = C. 

Proof: Choosing any set of P consecutive successes, we are ensured, by the dehnition of a (/, P)- 
cyclic execution, that for each thread, after the hrst success, the next success will be obtained after 
/ failures. The order will be preserved, and this shows that a seed including our set of successes is 
actually a well-formed seed. ■ 

Property 2. If there exists a well-formed seed in an execution, then after each thread succeeded once, 
the execution coincides with an {f, P)-cyclic execution. 

Proof: By the dehnition of a well-formed seed, we know that the threads will hrst succeed in 
order, fails / times, and succeed again in the same order. Considering the second set of successes 
in a new well-formed seed, we observe that the threads will succeed a third time in the same order, 
after failing / times. By induction, the execution coincides with an (/, P)-cychc execution. ■ 

Together with the seed concept, we dehne the notion of gap that we will use extensively in the 
next subsection. The general idea of those gaps is that within an (/, P)-cyclic execution, the period 
is higher than P x 1, which is the total execution time of all the successful retries within the period. 
The difference between the period (that lasts 1-\- q-\-r -\- f) and P, reduced by r (so that we obtain 
an integer), is referred as lagging time in the following. If the threads are numbered according to 
their order of success (modulo P), as the time elapsed between the successes of two given consecutive 
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threads is constant (during the next period, this time will remain the same), this lagging time can be 
seen in a circular manner (see Figure]^: the threads are represented on a circle whose length is the 
lagging time increased by r, and the length between two consecutive threads is the time between the 
end of the successful retry of the first thread and the begin of the successful retry of the second one. 
More formally, for all (n, k) E [0, P — Ij^, we define the gap between 7^ and its predecessor 
based on the gap with the first predecessor: 

r Vn E [1,P - 11 ; = Sn- S^-i - 1 

[ Gi^^ = So + q + r + f-Sp.i 
which leads to the definition of higher order gaps: 

n 

VnE[0,P-ll ; VA:>0 ; G« = ^ 

j=n—k-\-l 

For consistency, for all n E [O, P — ll, Gn"* = 0. 

Equally, the gaps can be obtained directly from the successes: for all k £ [I, P — ll. 



Sn — Sn-k — k a n > k 

Sn - Sp+n-k + l + q + r + f- k otherwise 


( 2 ) 


Note that, in an (/, P)-cyclic execution, the lagging time is the sum of all first order gaps, reduced 
by r. 

Now we extend the concept of well-formed seed to weakly-formed seed. 


Definition 4. Let S = (7i, S'i)jg|g p_^ be a seed. 

S is a weakly-formed seed for P threads if and only if: (71, S'i)jg|g p_ 2 ] is a well-formed seed for 
P — 1 threads, and the first thread succeeding after Tp -2 is Tp-i- 


Property 3. Let S = (7^, S'i)jg|g p_;^j be a weakly-formed seed. 

Denoting f = f (^{Ti, S'i)i£[o.p- 2 l); for each n E [O, P - ll, Gi^'^ < 1. 

f ^(k) 

Proof: We have S'p _2 -|- 1 < Sp-i < Pq) if we note indeed Gn the gaps within 
{%, P-2] ’ previous well-formed seed with P—1 threads, we know that for all n E [I, P — 21, 

Gn ^ = Gn \ and Gpl;^ -|- Gg^^ = Gg^\ which leads to Gn'^ < Gn \ for all n E [O, P — ll and k; hence 
the weaker property. ■ 
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To 

Ti 

T2 


Figure 6: Lemma configuration 


Lemma 1. Let S be a weakly-formed seed, and f = f (^(71, 5'i)^g|g p_ 2 j j - If, for all n £ [[0,P— Ij, 

< 1, then there exists later in the exeeution a well-formed seed S' for P threads such that 
f{S') = f+l. 

Proof: The proof is straightforward; S is actually a well-formed seed such that f (S) = f -\- 1. 
Since Rq — Sp-i < < 1, the hrst success of To after the success of Tp-i is its / -|- retry. ■ 

B. Cyelie Executions 

Theorem 1. Given a seed S = {Pi,Si)-^^Qp_^, S is a well-formed seed if and only if for all 
n E [0,P- 11, 0 < < 1. 

Proof: 

Let S = (7i, S'i)jg|g p_]^j be a seed. 

’ f f) 

(•t=) We assume that for all n E [0, P — ll, 0 < Gn < 1, and we hrst show that the hrst successes 
occur in the following order: % at So, Ti at Si, ..., Tp-i at Sp-i, To again at R^. The hrst threads 
that are successful executes their parallel section after their success, then enters their second retry 
loop: from this moment, they can make the hrst attempt of the threads, that has not been successful 
yet, fail. Therefore, we will look at which retry of which already successful threads could have an 
impact on which other threads. 

We can notice that for all n E [[0,P — ll, if the hrst success of Tn occurs at Sn, then its next 
attempts will potentially occur at R^ = Sn + I + q + r + k, where A; > 0. More specihcally, thanks 
to Equation]^ for all n < f, R^ = Sp+n-f + Gn'^ + k. Also, for all A: < / — n, 

“ Sp+n-f+k = — {Sp+n-f+k “ Sp+n-f — k) -£ G^J^ 

— nil) _ 

— '^ppn-f+k 

= (3) 

and this implies that if A: > 0, 

Sp+r.-f+k-Rt" = l-Gif-'^l (4) 

We know, by hypothesis, that 0 < Gn~^'^ < 1, equivalently 0 < 1 — Gn~^^ < 1. Therefore 
Equation [3 states that if a thread Pw starts a successful attempt at Sp+n-f+k^ then this thread will 
make the retry of T^ fail, since T^ enters a retry while T^/ is in a successful retry. And Equation]^ 
shows that, given a thread T^/ starting a new retry at Sp+n-f+k^ the only retry of T^ that can make 
Pn' fail on its attempt is the {k — 1)*^ one. There is indeed only one retry of Pn that can enter a 
retry before the entrance of Pw, and exit the retry after it. 

To is the hrst thread to succeed at Sq, because no other thread is in the retry loop at this time. 
Its next attempt will occur at i?g, and all thread attempts that start before Sp-j (included) cannot 
fail because of To, since it runs then the parallel section. Also, since all gaps are positive, the threads 
Ti to Pp-f will succeed in this order, respectively starting at times Si to Sp-f. 
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Then, using induction, we can show that 7p-/+i, ..., Tp-i succeed in this order, respectively 
starting at times S'p_/+i, ..., Sp-i. For j E [0,/ — Ij, let {Vj) be the following property: for all 
n E [0, P — / + ij, 71^ starts a successful retry at Sn. We assume that for a given j, (Vj) is true, and 
we show that it implies that Tp-/+j+i will succeed at Sp-f+j+i. The successful attempt of Tp-^j 
at Sp-f+j leads, for all j' E [O, jJ, to the failure of the retry of Tj-j' (explanation of Equation^. 
But for each Tj', this attempt was precisely the one that could have made Tp-f+j+i fail on its attempt 
at Sp-f+j+i (explanation of Equation]^. Given that all threads Tn, where n > P — / + j + 1, do 
not start any retry loop before Sp-f+j+i, Tp-f+j+i will succeed at Sp-f+j+i. By induction, (Vj) is 
true for all j E [0, / — Ij. 

Finally, when Tp-i succeeds, it makes the (/ — 1 — retry of Tn fail, for all n E [[0, / — Ij; also 
the next potentially successful attempt for Tn is at Rn~^- (Naturally, for all n E [[/, P — Ij, the next 
potentially successful attempt for T is at P°.) 

We can observe that for all n < P, j E [[0, P — 1 — nj, and all k > j, 

R-n+j ~ ~ Sn+j ~\~ k — j — (^Sn + k) 

= (5) 

hence for all n E [1, /J, Rh~^ — Pg = > 0. 

Rt^ - Ri = Gi") > 0. 

As we have as well, for all n E [[/ + 1, P — Ij, P® > P*J, we obtain that among all the threads, 
the earliest possibly successful attempt is Pg. Following Tp-i, To is consequently the next successful 
thread in its retry. 

To conclude this part, we can renumber the threads (7^+i becoming now 7^ if n > 0, and To 
becoming 7p-i), and follow the same line of reasoning. The only difference is the fact that 7p-i 
(according to the new numbering) enters the retry loop / units of time before Pp-i, but it does not 
interfere with the other threads, since we know that those attempts will fail. 

There remains the case where there exists n E [O, P — Ij such that Gii = 0. This implies that 
/ = 0, thus we have a well-formed seed. 

(=^) We prove now the implication by contraposition; we assume that there exists n E [O, P — 111 
if) (f) 

such that Gii' > 1 or Gii' < 0, and show that S is not a well-formed seed. 

We assume first that an order gap is negative. As it is a sum of 1®* order gaps, then there 
exists n' such that G^^!^ is negative; let n" be the highest one. 

If n" > 0, then either the threads To,... ,Tn"-i succeeded in order at their 0*^ retry, and then 
Tn"-i makes T" fail at its retry (we have a seed, hence by definition, Sn"-i < Sn", and G^^,} < 0, 
thus Sn"-i < Sn" < Sn"-i -l- 1 ), or they did not succeed in order at their first try. In both cases, S 
is not a well-formed seed. 

If n" = 0, let us assume that 5 is a well-formed seed. Let also a new seed be S' = {T, POiep P-ip 
where for all n E [[0, P — 2]], and Pg = Pp_i — (g-|- 1 r). Like S, S' is a well-formed 

seed; however, G^^^ is negative, and we fall back into the previous case, which shows that S' is not 
a well-formed seed. This is absurd, hence S is not a well-formed seed. 

We assume now that every gap is positive and choose ng defined by: rig = min{n ; 3k E 
[[0, P — > 1}, and /g = min{A: ; Gl^^_^_|^ > 1}: among the gaps that exceed 1, we pick 

those that concern the earliest thread, and among them the one with the lowest order. 

Let us assume that threads To, ..., 7p-i succeed at their 0**^ retry in this order, then To, ..., To 
complete their second successful retry loop at their retry, in this order. If this is not the case, then 
S is not a well-formed seed, and the proof is completed. According to Equationwe have, on the one 
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hand, = G^^o+v which implies RII+i-I-RII = thus -(i ?4 + 1 ) = G^no+v 

and on the other hand, - R^ = implying - (i?^ + 1) = - 1 . As 

we know that = GG°_^_j^ < 1 by dehnition of /o (and no), we can derive that 

<+.-« + !) ><"4 — {Rfig + !)■ We have assumed that Tno succeeds at its retry, which 
will end at R(^^ + 1. The previous inequality states then that Tno+i cannot be successful at its 
retry, since either a thread succeeds before Tno+fo makes both Tno+fo and Tno+i fail, or Tn^+fo 
succeeds and makes Tno+i fail- We have shown that S is not a well-formed seed. ■ 

Lemma 2. Assuming r 7 ^ 0, if a new thread is added to an {f,P)-eyelie exeeution, it will eventually 
sueeeed. 

Proof: 

Let Rp be the time of the 0**^ retry of the new thread, that we number Tp. If this retry is successful, 
we are done; let us assume now that this retry is a failure, and let us shift the thread numbers (for 
the threads To, ..., Tp-i) so that To makes Tp fail on its hrst attempt. We distinguish two cases, 
depending on whether G^f^ > Rp — So or not. 

We assume that G^f^ > Rp — So- We know that n 1 —)■ is increasing on [[0,F — Ij and 

that G® = 0, hence let no = min{n G [[0,F — Ij ; G^^ < Rp — Ao}. For all k G [[0,noj, we have 
R'f-Sk = k + R°p-{Gi^'’ + So + k) = R^p-So-Gi^^ hence R'f-Sk > 0 and R'f-Sk < R^p-So < 1. 
This shows that To, ..., Tno 7 because of their successes at So, ■ ■ ■, Sng, successively make ..., 
retries (respectively) of Tp fail. The next attempt for Tp is at Rp‘~^^, which fulhlls the following 
inequality: Rp°~^^ - {Sno + ^) < Sno+i - {Sno + '^) since 

^no+i _ -|-1 -|- Rp) — ^ So -|- no -|-1) 

Rf? + ^ - Sno + l > 0 . 

Tno +1 should have been the successful thread, but Tp starts a retry before S„q+i, and is therefore 
succeeding. 

We consider now the reverse case by assuming that G^f '^ < Rfp — So. With the previous line of 
reasoning, we can show that To, ..., Tp-i, because of their successes at So, ..., Sp_i, successively 
make 0*^, ..., {P — 1)*^ retries (respectively) of Tp fail. Then we are back in the same situation 
when To made Tp fail for the hrst time (To makes Tp fail), except that the success of To starts at 
Sq = So -|- As Gg^^ = g'-|-r-|-/ — P>0 and q, f and P are integers, we have that Gg^^ > r. 
By the way, if we had Gg^^ > r, we would have Gg^^ > 1 -|- r > Rf, — So, which is absurd. So makes 
indeed Rp fail, therefore Gg should be less than 1 . Consequently, we are ensured that Gg — r. 
We dehne 



also, for every A; G [1, /cgj, r < Rfp — (So -|- A: x r) and r > Rp — (So -|- {ko -|- 1) x r): the cycle of 
successes of To, ..., Tp-i is executed ko times. Then the situation is similar to the hrst case, and 
Tp will succeed. 


Lemma 3. Let S be a weakly-formed seed, and f = f (^(Ti, Si)^g|g p_ 2 | If > 1, and if the 

seeond sueeess of Tp-i does not oeeur before the seeond sueeess of T/-i, then we ean find in the 
exeeution a well-formed seed S' for P threads sueh that f (S') = f. 

Proof: 
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To I ^ I - ^ ^ - 1 I - ^ - 1 I - ^ ^ I - ^ - 

TiH — ^ I - ^ ^ I - ^ ^ ^ I - ^ i - ^ — I I - 

T 2 1- ^ — I - I - ^ — I I - 1 I - ^ —I I - ^ — I I - 

% I - 1 I - ^ —I I - ^ — I I - ^ f h 

Figure 7: Lemma ^ configuration 


Let us first remark that, by the dehnition of a weakly-formed seed, all threads will succeed once, 
in order. Then two ordered groups of threads will compete for each of the next successes, until 7/-i 
succeeds for the second time. 

Let e be the smallest integer of [/, P — Ij such that the second success of 7^ occurs after the 
second success of 7/-i. Let then 5i and ^2 be the two groups of threads that are in competition, 
dehned by 

Si = {Tn \ nG 

^2 = { 7 ; ; n G [[/,e - Ij} 


For all n G [0, e — Ij, we note 


rank (n) 


if Tn G 5i 
-1 if G 52 


We dehne cr, a permutation of [0, e — Ij that describes the reordering of the threads during the 
round of the second successes, such that, for all {i,j) G [[0,e — Ij^, a (i) < a (j) if and only if 
rank{i) < rank{j). 

We also dehne a function that will help in expressing the a~^ (A:)’s: 


m 2 : 


[0,e-ll 

k 



If, e - 11 

max {£ G [[/, e - 11 ; Te e S 2 


a {£) < k} 


We note that rank\^^ is increasing, as well as rank\y This shows that #{7^ G ^2 ; cr (£) < 
k} = m 2 {k) — (/ — 1). Consequently, if Ta-^{k) £ ^ 2 , then 

m 2 {k) = G 52 ; <k} + f -I 

= #{7^g52; i<a-\k)] + f-I 
= cr“^ (/c) - / + 1 + / - 1 
m 2 {k) = (k). 


Conversely, if Ta-^{k) ^ 5i, among {7^(„) ; n G [0, A:!}, there are exactly m 2 (/c) — / + 1 threads 
in 52, hence 


fj ^ [k) = k + 1 — {m 2 (A:) — / + 1) — 1 = / + A: — m 2 (A:) — 1. 


In both cases, among {7^(n) ; n G [O, A:!}, there are exactly m 2 (A:) — / + 1 threads in 52, and 
mi (k) = k — {m 2 {k) — f) threads in 5i. 

We prove by induction that after this hrst round, the next successes will be, respectively, achieved 
by 7^-i(o), 7^-i(i), ■ ■ ■, 7^-i(e-i)- In the following, by “A:*^ success”, we mean A:**^ success after the 
hrst success of Tp-i, starting from 0, and the Rj’s denote the attempts of the second round. 
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Let {Vk) be the following property: for all k < K, the /c*" success is achieved by Ta-i(k) at 
. We assume {Vk) true, and we show that the {K +success is achieved by Ta-^(K+i) 


J+k-(T ^(k) 


K-^k) 

We hrst show that if Tcr-^{K) £ then 


f+K-a-^(K) 


( 6 ) 


On the one hand, 


f+K-<T-^{K) 


f+K-a-HK) 


= K - a-^ {K) + 

= K-a-^ {K) + Ri + a-^ {K) + 
= K + Sp.i + l + 

= K + Sp.r + l + 


On the other hand. 


f+K-m2(K) 

^m2{K) 


fpK-m2(K) 

^m2{K) 


(m2 {K)-f) + R^ 

(m2 {K) - f) + K - (m2 {K) 
(m2 {K) - f) + K - (m2 {K) 


K-(ni 2 {K)-f) , M7n2{K)-f) 
■T '-’m2{K) 


f) + R) + 

/) + Sp.i + 1 + 


K + Sp—i + 1 + G, 


(m2{K) + l) 
m2{K) 


- 1 . 


1) + G 


{m2{K)-f) 
m2 (if) 


Therefore, 

f+K-a-^{K) _ nmi{K) _ f+K-cr-^{K) _ f+K-'m2{K) 

^a-^(K) ^m2{K) ~ ^a-^{K) ^m2{K) 

- '^CT-^K) y^m2iK) 

K-^k) - Ki%] = (^)) - ("^2 {K)) . 

In a similar way, we can obtain that if Ta-^{K) G ‘^ 2 , then 


rjm2(K) 


f+K-a-HK) 
^ ^0-1 (iC) 


> R 


m2{K) + l 
mi{K) — l * 


(7) 


In addition, we recall that if Ta-^{K) £ ^ 2 , o' ^ (K) = m 2 (K), thus the second inequality of 
Equation becomes an equality, and if 7(j-i (^K) £ ‘5i, a ^ (K) = f + K — m 2 {K) — 1, hence the 
second inequality of Equation becomes an equality. 

Now let us look at which attempt of other threads Ta-^(K) made fail. From now on, and until 
explicitly said otherwise, we assume that %r-^{K) £ ‘^i. According to Equation [6| we have 


pmi(iC)-l 

■^m2(A:) + l 
rjmi{K)-j _ „mi{K)-l 
^m2{K)+j ^m2{K) + l 

^0“i) 

'^m2{K)+j 


> 

< 

< 


f+K-a-HK) 

^cr-'^iK) 

^m2{K)+j ^a-^(K) 

^m2{K)+j 


> 

< 

< 


Tjmi{K) 

^m2{K) 
^m2(K)+j ^m2{K) 

Mj) 

'^m2(K)+j 


This holds for every j G [l,mi {K)J, implying j < /, since there could not be more than / threads 
in 5i. Therefore, as by assumptions gaps of at most order are between 0 and 1, 


0 < R 


'm2(K)+j 


f+K-a-^K) 

-^cr-q/c) 


< 1 ; 
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showing that the success of Ta-^(K) makes thread %n 2 {K)+j fail on its attempt at for 

j G (K)}. 

Since (K) = mi (K) — 1. Also, for all j G [0, / — 1 — mi (AT)]], 


all 


T^m2(K)-j _ rjf+K-a ^{K) _ jjm2{K)-j _ jjm.2{K) + l 


r>m2{K)-j __ jjf+K-a ^{K) _ ^ 
^m,2{K)+i ^cr-i(lC) “ '-'ni2{K)+j 


_ ( j^m2(K)-j 
'mi(K) 

(j+1) 


= {R:i‘^)V~_[ + {j + i) + G^ 


(j+i) 

'mi{K)+j 


)-( 


-icSd+o+i) 


As a result, Ta--i-(K) makes Tmi{K)+j fail on its attempt at for all j G [0, / — 1 — mi (AT)]], 

and the next attempt will occur at 

Altogether, the next attempt after the end of the success of Ta-^(K) for Tmi{K)+j is 

for j G [O, / - 1 - mi {K)j, and for %n 2 (K)+j is RZI{k)+]^^^ all j G p,mi {K)j. 

Additionally, a thread will begin a new retry loop, the 0**^ retry being at RZ 2 {K)+ 7 ni{K)+i ~ 
We note that / + A' + 1 could be higher than P — 1, referring to a thread whose number 
is more than P — 1. Actually, if n > P — 1, Pf refers to the retry of 7^anfc(n-p+i)) after its hrst 
two successes. 

The two heads, i.e. the two smallest indices, of 5i n cr“^ ([[A +1, e — Ij) and ^2 n 
0 "“^ ([[A + 1, e — Ij) will then compete for being successful. Indeed, within 5i, for j G 
lO, / - 1 - mi (A)]], 


^m2{K)-j + l _ rjm2(K) + l _ ^{j) „ 

thus if someone succeeds in 5i, it will be Trni{K)- la the same way, for all j G p,mi (A) + Ij, 

Tjmi{K)-j + l _ Tj7ni{K) _ ^(j-1) ^ 

^m2(K)+j ^m2(K) + l m2{K)+j ’ 

meaning that if someone succeeds in ^ 2 , it will be Tm 2 {K)+i- 
Let us compare now those two candidates: 


p: 


.m2{K) + l 
'mi(K) 


- )+. = >"2 (A-) + 1 - / + Sp_, + m, (A) + 

- pi (A') + + 1712 (A') + 1 - / + 




= Pp_i -l + G, 


- {Sp 


(mi(K) + l) 
mi{K) 


-l + G 


(/+!) 


-l + G 


_ Q{mi{K) + l) 


mi(K) 


-{ 


g: 


(m2(K)+2) 
m2{K) + l 


{m 2 {K) + l-f) 
m2{K) + l 


jjm2(K) + l ^mi{K) 

it / T^\ - IX 

mi(K) 


= rank (mi (A)) — rank (m 2 (A) + 1) . 


By dehnition, (A + 1) is either mi (A) or m 2 (A) + 1 and corresponds to the next successful 
thread. We can follow the same line of reasoning in the case where Ta-'^(K) ^ ^2 and prove in this 
way that (Vk+i) is true. 

(Po) is true, and the property spreads until (Ve-i), where all threads of 5i and S 2 have been 
successful, in the order ruled by i.e. 7(j-i(o), ■■■, Ta-i{e-i)- And before those successes the 
threads Te-i =7((-i(e-i), • • •, Tp-i have been successful as well. The seed composed of those successes 
is a well-formed seed. Given a thread, the gap between this thread and the next one in the new order 
could indeed not be higher than the gap in the previous order with its next thread. Also the 
order gaps remain smaller than 1. And as Te-i succeeds the second time after / failures, it means 
that the new seed S" is such that / {S") = /. 
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To I h 

Ti H -1 

T2 I - 

%\ - 


Figure 8: Lemma configuration 


Lemma 4. Let S be a weakly-formed seed, and f = f S'i)jg|Q p_ 2 ] ^. If > 1 and if the 

second success ofTp-i occurs before the second success ofTf-i, then we can find in the execution a 
well-formed seed S' for P threads such that f (S') = /. 

Proof: Until the second success of Tp-i, the execution follows the same pattern as in Lemma 
Actually, the case invoked in the current lemma could have been handled in the previous lemma, 
but it would have implied tricky notations, when we referred to %.ank(n-p+i)- Let us deal with this 
case independently then, and come back to the instant where 7p-i succeeds for the second time. 

We had 0 < — Sp-i = < 1. For the thread Ta{j) to succeed at its retry after 

the hrst success of Tp-i and before 7/-i, it should necessary hll the following condition: j + 1 < 
^ Sp-i < J + 1 + This holds also for the second success of Tp-i, which implies that 

P' < Sp-i + l + g + r + /i — Sp-i < P' + where h is the number of failures of Tp-i before its 

( f) 

second success and P' is the number of successes between the two successes of Tp-i- As < 1, 

( f ) 

and q, P' and h are non-negative integers, we have r < and h = P' — 1 — q. 

To conclude, as any gap at any order is less than the gap between the two successes of Tp-i, 
which is r < 1, we found a well-formed seed for P' threads. 

Finally any other thread will eventually succeed (see Lemma|^. We can renumber the threads such 
that Tp' is the first thread that is not in the well-formed seed to succeed, and the threads of the well- 
formed seed succeeded previously as To, ..., Tp'-i. As explained before, for all (k, n) E [O, P' — 1]]^, 
Gn' < Gn ' = r. With the new thread, the first order gaps are changed by decomposing Gg into 
Gp} and the new Gq^\ All gaps can only be decreased, hence we have a new well-formed seed for 
P' -|- 1 threads. We repeat the process until all threads have been encountered, and obtain in the 
end S', a well-formed seed with P threads such that / (S') = P — 1 — q, which is an optimal cyclic 
execution. 

Still, as 7/ succeeds between two successes of 7p-i that are separated by r, we had, in the initial 
conhguration: Gp_-^ " < r. As, in addition, we have both Gy_U < 1 and Gj < 1, we conclude that 
the lagging time was initially less than 2-\-r. By hypothesis, we know that > 1, which implies 

that, before the entry of the new thread, the lagging time was 1 -|- r. In the final execution with one 
more thread, the lagging time is r and we have one more success in the cycle, thus / (5^ = /. ■ 

Theorem 2. Assuming r ^ 0, if a new thread is added to an (/, P — l)-cyclic execution, then all the 
threads will eventually form either an (f,P)-cyclic execution, or an (f -\- 1, P)-cyclic execution. 

Proof: According to Lemma the new thread will eventually succeed. In addition, we recall 
that Properties and ensure that before the first success of the new thread, any set of P — 1 

























16 


consecutive successes is a well-formed seed with P — 1 threads. We then consider a seed (we number 
the threads accordingly, and number the new thread as Tp-i) such that the success of the new 
thread occurs between the success of Tp -2 and To; we obtain in this way a weakly-formed seed 
S = {Tn, Sn)nelo p-ijSz- differentiate between two cases. 

Firstly, if for all n E [0, P — Ij, < 1, according to Lemma we can hnd later in the 

execution a well-formed seed S ' for P threads such that / (5^) = f + 1, hence we reach eventually 
an (/ -I- 1, P)-cyclic execution. 

Let us assume now that this condition is not fulhlled. There exists uq E [[0, P — Ij such that 
> 1. We shift the thread numbers, such that uq is now /, and we have then > 1. 

Then two cases are feasible. If the second success of 7p-i occurs before the second success of 7/-i, 
then Lemma 1^ shows that we will reach an (/, P)-cyclic execution. Otherwise, from Lemma we 
conclude that an (/, P)-cyclic execution will still occur. ■ 


C. Throughput Bounds 

Firstly we calculate the expression of throughput and the expected number of threads inside the 
retry loop (that is needed when we gather expansion and wasted retries). Then we exhibit upper 
and lower bounds on both throughput and the number of failures, and show that those bounds are 
reached. Finally, we give the worst case on the number of wasted retries. 


Lemma 5. In an {f,P)-cyclic execution, the throughput is 

q + r+l + f 

Proof: By dehnition, the execution is periodic, and the period lasts q + r + 1 + f units of time. 
As P successes occur during this period, we end up with the claimed expression. ■ 


Lemma 6. In an {f, P)-cyclic execution, the average number of threads Pri in the retry loop is given 
by 

Prl = P X 


/ + ! 


q + r + f + 1' 

Proof: Within a period, each thread spends f + 1 units of time in the retry loop, among the 
q + r + f + 1 units of time of the period, hence the Lemma. ■ 


Lemma 7. 


The number of failures is not less than /( \ where 




P — q — 1 if q < P — 1 
0 otherwise 


and accordingly. 


T < 


p 

P+r 

P 

q+r+1 


if q<P-l 
otherwise. 


(9) 


Proof: According to Equation the throughput is maximized when the number of failures 
is minimized. In addition, we have two lower bounds on the number of failures: (i) / > 0, and 
(ii) P successes should ht within a period, hence q + 1 + f > P. Therefore, if P — 1 — g < 0, 
T <P/{q + r+ l + 0), otherwise. 


T < 


P 

q + r+ l + P—1 — q 


P 

P + r 


Remark 3. We notice that ii q > P 


the immediate upper bound described in Section III-Bl 
the immediate upper bound. 


1, the upper bound in Equation is actually the same as 
However, if g < P — 1, Equation rehnes 
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Lemma 8. The number of failures is bounded by 

f < = 

the throughput is bounded by 

T > - — -. 

q + r+l + /(+) 

Proof: We show that a necessary condition so that an (/, P)-cyclic execntion, whose lagging 
time is £, exists, is f x {£ + r) < P. According to Property any set of P consecutive successes 
is a well-formed seed with P threads. Let S be any of them. As we have / failures before success, 
Theorem 1 ensures that for all n £ [O, P — 111, Gn < 1. We recall that for all n E [fO, P — 111, we 

•—' (P) 

also have Gn = £ + r. 

On the one hand, we have 

P—1 P—1 n 

j:oir = Y. T. GL.P 

71=0 71=Oj=71 —/+! 

= /x 

71=0 

J2Gif^ = fx{i + r). 

71=0 

On the other hand, X]n=o ^ = P- 

Altogether, the necessary condition states that f x {£ + r) < P, which can be rewritten as f x 
{q + l + f — P + r) < P. The proof is complete since minimizing the throughput is equivalent to 
maximizing the number of failures. ■ 

Lemma 9. For each of the bounds defined in Lemmas[^and[^ there exists an (/, P)-cyclic execution 
that reaches the bound. 


- K P - l- q-r) + ^J{P-l-q-r)‘^ + 4Pj 


and accordingly, 


Proof: According to Lemmas and if an (/, P)-cyclic execution exists, then the number of 
failures is such that < / < / ■ 

We show now that this double necessary condition is also sufficient. We consider / such that < 
/ < f^*\ and build a well-formed seed S = (71, S'i)jg|Q 
For all n E [0, P — Ij, we dehne Si as 


Sn = n X 


^ q+l + f- P + r 



We hrst show that / (5) = /. By dehnition, / (S) = max (0, [Pp-i — 5*0 — g — r]); we have then 


f{S) = max (^0, 
= max ^0, 
f{S) = max fo. 


(P-l)x 


q+l + f-P + r 

■— p ~ 


{P-l-q-r) + {q+l + f-P + r)- 
q+l + f-P + r 


-I- 1 ) — q — r 

q+l + f - P + r 


P 


f- 


P 


Firstly, we know that g'-|-l-|-/ — P>0, thus if / = 0, then the second term of the maximum is not 
positive, and / (S) = 0 = /. Secondly, if / > 0, then according to Lemma[^ {q+^ + f — P + r)/P < 


1/f < 1. As we also have (g-|-l-|-/ —P-|-r)/P > 0, we conclude that f (S) = 


/- 


q+l+f-P+r 
P 


= /• 
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Additionally, for all n G [0, P — Ij, 




Sn - Sn-f - / if n > / 

Sn — Sp+n-f + 1 + q + r otherwise 


n X h+^+f-P+- + _ (n - /) X ( g+i+/-^+’- + l) - / 

n X ^a±l±lpP±L + 1 j _ (p + „ _ /) X (^<i±l±£pP±L + 1 ^ + 1 + ^ + ^ 
j ^ g+i+/--P+»~ 

-{P-f)-{q+l + f-P + r) + fx g+i+/-^+^ + 1 + 5 r 


/X 


ID + r 
P 


As w < 0 and / < 0, > 0. Since / < Gn'^ < 1. Theorem implies that 5 is a well-formed 

seed that leads to an (/, P)-cyclic execntion. 

We have shown that for all / snch that there exists an (/, P)-cyclic execntion; in 

particnlar there exist an P)-cyclic execntion and an P)-cyclic execntion. ■ 


Corollary 1. 

P=q+1. 


The highest possible number of wasted repetitions is 


Vp-1 


and is achieved when 


Proof: 

The highest possible number of wasted repetitions w{P) with P threads is given by 


w{P) = /(") - /(-) = (-a(P) + ^a{Py + 4P^ - /(■ 


Let a and h be the functions respectively dehned as a{P) = g-l-l —P-|-r, which implies a'{P) = —1, 
and h{P) = (—o(P) -I- yj a{PY + AP)/2 — f^''\ so that w{P) = [h{P)\. 

Let us hrst assume that a{P) > 0. In this case, q < P — 1, hence = 0. We have 

-2a(P) + 4 


2h\P) = 1 + 
2h'{P) = 2 X 


2y^a{P)^ + 4P 
2 - a{P) + ^/a(P)" + 4P 
' 2y/a{Py^ + 4P 


Therefore, h'{P) is negative if and only if \/a(Pp~+ 4 P < a{P) — 2. It cannot be true if a{P) < 2. 
If a(P) > 2, then the previous inequality is equivalent to a(P)‘^ + 4P < a{P)‘^ — 4a(P) -|- 4, which 
can be rewritten in 5 -|-l-|-r<l, which is absurd. We have shown that h is increasing in ]0, g -|- 1]. 

Let us now assume that a{P) < 0. In this case, q > P — 1, hence = P — q — 1, and 
P(P) = ^a(P) + \/a{PY + 4P^ /2 — r. Assuming h'{P) to be positive leads to the same absurd 
inequality g -|- 1 -|- r < 1, which proves that h is decreasing on [q + 2, -|-oo[. 

Also, the maximum number of wasted repetitions is achieved as P = q+1 or P = q + 2. Since 

h{q +1) = ^ (^-r + Vr^ + 4P^ > ^ (^-{r + 1 ) + Vr^ + 4P^ = h{q + 2 ), 
the maximum number of wasted repetitions is w{q+ 1). In addition. 


+ Vip) 
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h{q + 

1) 

< 

1 

to 1 

< 

h{q + 

1) 

< 

y/P-l 

< 

h{q + 

1) 

< 


1 + Vip) 

Vp 

Vp 

























19 


We conclude that the maximum number of wasted repetitions is 


Vp-1 


V. Expansion and Complete Throughput Estimation 


A. Expansion 


Interference of threads does not only lead to logical conflicts but also to hardware conflicts which 
impact the performance signihcantly. We model the behavior of the cache coherency protocols which 
determine the interaction of overlapping Reads and CASs. By taking MESIF |GH09] as basis, we 
come up with the following assumptions. When executing an atomic CAS, the core gets the cache line 
in exclusive state and does not forward it to any other requesting core until the instruction is retired. 
Therefore, requests stall for the release of the cache line which implies serialization. On the other 
hand, ongoing Reads can overlap with other operations. As a result, a CAS introduces expansion 
only to overlapping Read and CAS operations that start after it, as illustrated in Figure As a 
remark, we ignore memory bandwidth issues which are negligible for our study. 

Furthermore, we assume that Reads that are executed just after a CAS do not experience expansion 
(as the thread already owns of the data), which takes effect at the beginning of a retry following a 
failing attempt. Thus, read expansions need only to be considered before the retry. In this sense, 
read expansion can be moved to parallel section and calculated in the same way as CAS expansion 
is calculated. 

To estimate expansion, we consider the delay that a thread can introduce, provided that there is 
already a given number of threads in the retry loop. The starting point of each CAS is a random 
variable which is distributed uniformly within an expanded retry. The cost function d provides the 
amount of delay that the additional thread introduces, depending on the point where the starting 
point of its CAS hits. By using this cost function we can formulate the expansion increase that each 
new thread introduces and derive the differential equation below to calculate the expansion of a 
CAS 

Lemma 10. The expansion of a CAS operation is the solution of the following system of equations: 


fplT) 


= CC X 


= 0 


+ e {Prl) 


rc+ cw + CC + e (Pri) 


where P^f^ is the point where 
expansion begins. 


Proof: 

We compute e {Pri + h), where h < 1, by assuming that there are already Pri threads in the retry 
loop, and that a new thread attempts to CAS during the retry, within a probability h. 
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e {Pri + h) = e (Pri) + hx / 

Jo rlw^ ' 


— e i^Pri) P hx 


rc-\-cw— cc 


dt 

d{t) 


dt 


+ 

+ 

+ 


rrc+cw ^ 


I ('+'1 

rc+cw—cc 
rrc+cw+e(Pri) d 


/ rc-\-cw 


d{t) 


dt 


— e {Pri) P hx 


s {Prl P h) — e {Prl) P fix- 


rc+cw+e{Pri) rlvj^*'^ 
rc-\-cw -j- 

rc-\-cw—cc 

rrc+cw+e(Pri) qq 

P 


dt 


/ rc-\-cw 




dt 


P e {Prl) X cc 


This leads to 
obtain 


e {Prl Ph)- e {Prl) ^ Pe {Pri) X cc 


h 


rlw^* 


. When making h tend to 0, we finally 


e' {Prl) = cc X 


P e {Prl) 


rep cw P ccp e {Pri) 


B. Throughput Estimate 

There remains to combine hardware and logical conflicts in order to obtain the final upper and 
lower bounds on throughput. We are given as an input an expected number of threads Pri inside 
the retry loop. We fi rstly compute the expansion accordingly, by solving numerically the differential 
equation of Lemma [l^ As explained in the previous subsection, we have = pw P e, and 

rlw^*'^ = rc P cw P e P cc. We can then compute q and r, that are the inputs (together with the 
total number of threads P) of the method described in Section IV Assuming that the initialization 
times of the threads are spaced enough, the execution will superimpose an (/, P)-cyclic execution. 
Thanks to Lemma we can compute the average number of threads inside the retry loop, that we 
note by hf{Pri). A posteriori, the solution is consistent if this average number of threads inside the 
retry loop hf{Pri) is equal to the expected number of threads Pri that has been given as an input. 

Several (/, P)-cyclic executions belong to the domain of the possible outcomes, but we are 
interested in upper and lower bounds on the number of failures /. We can compute them through 
Lemmas and along with their corresponding throughput and average number of threads inside 
the retry loop. We note by h^*\Pri) and h^~\Pri) the average number of threads for the lowest 
number of failures and highest one, respectively. Our aim is finally to find P^i^ and Pri \ such that 
h^*\PrP) = PrP and h^~\Pri = Prl If several solutions exist, then we want to keep the smallest, 
since the retry loop stops to expand when a stable state is reached. 

Note that we also need to provide the point where the expansion begins. It begins when we start to 
have failures, while reducing the parallel section. Thus this point is (2P—(resp. {P—l)rlw^~^) 
for the lower (resp. upper) bound on the throughput. 
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Theorem 3. Let (xn) be the sequence defined recursively by xq = 0 and Xn+i = h^'^\xn)- If pw > 
rc+ cw + cc, then 

Plf’ = lim Xn- 

Proof: First of all, the average number of threads belongs to ]0,P[, thus for all x G [0,i^], 
0 < h^*\x) < P. In particular, we have h^*\0) > 0, and h^*\P) < P, which proves that there exist 
one fixed point for h^*\ 

In addition, we show that h^'^'l is a non-decreasing function. According to Lemma 

h(^\Prl) = Px 

q + r + fy i 1 

where all variables except P depend actually on P^i- We have 


q = 


pw -L e 


_rlw^ ^ 


-I- e. 


and r = 


pw -L e 


rlw^ 'I + e 

hence, if pw > rlw^~\ q and r are non-increasing as e is non-decreasing, which is non-decreasing 
with Pj-i. Since f^"'^ is non-decreasing as a function of q, we have shown that if pw > rlw^~\ h^*^ is 
a non-decreasing function. 

Finally, the proof is completed by the theorem of Knaster-Tarski. ■ 

The same line of reasoning holds for as well. As a remark, w point out that when pw < rlw^~\ 
we scan the interval of solution, and have no guarantees about the fact that the solution is the 
smallest one; still it corresponds to very extreme cases. 


C. Several Retry Loops 

We consider here a lock-free algorithm that, instead of being a loop over one parallel section and 
one retry loop, is composed of a loop over a sequence of alternating parallel sections and retry loops. 
We show that this algorithm is equivalent to an algorithm with only one parallel section and one 
retry loop, by proving the intuition that the longest retry loop is the only one that fails and hence 
expands. 

1) Problem Formulation: In this subsection, we consider an execution such that each spawned 
thread runs Procedure Combined in Figure Each thread executes a linear combination of S 
independent retry loops, i.e. operating on separate variables, interleaved with parallel sections. 
We note now as rlw^ and pw^ the size of a retry of the retry loop and the size of the 
parallel section, respectively, for each i G [[1,5']]. As previously, qi and are dehned such that 
= {qi + ri) X rlw\*\ where qi is a non-negative integer and is smaller than 1. 


Procedure Combined 


1 lnitialization(); 

2 while ! done do 

3 for i ^ 1 to S do 

4 I Parallel_Work(z); 

5 while ! success do 

6 current ^ Read(AP[i]); 

7 new ■(— Critical_Work(z,current); 

8 success ^ CAS(AP, current, new); 


Figure 9: Thread procedure with several retry loops 
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The Procedure Combined executes the retry loops and parallel sections in a cyclic fashion, so we 
can normalize the writing of this procedure by assuming that a retry of the 1®* retry loop is the 
longest one. More precisely, we consider the initial algorithm, and we dehne zq as 


io = minargmaxjgp^^j 




We then renumber the retry loops such that the new ordering is zq) ■ ■ ■ > 5’, 1,..., io ~ Ij and we add 
in Initialization the first parallel sections and retry loops on access points from 1 to zq — according 
to the initial ordering. 

One success at the system level is defined as one success of the last CAS^ and the throughput is 
dehned accordingly. We note that in steady-state, all retry loops have the same throughput, so the 
throughput can be computed from the throughput of the 1®* retry loop instead. 

2) Wasted Retries: 

Lemma 11. Unsuccessful retry loops can only occur in the 1®* retry loop. 

Proof: 

We note {tn)ns[i,+oo[ the sequence of the thread numbers that succeeds in the 1®* retry loop, and 
(sri)nG[i,-i-oo[ the Sequence of the corresponding time where they exit the retry loop. We notice that 
by construction, for all n G [l,-|-oo[, < Sn+i- Let, for z G [2, 5]] and n G [l,-|-oo[, (Vi^n) he the 

following property: for all i' G [[2,z]], and for all n' G [[l,zz]], the thread 7z^, succeeds in the z*'^ retry 
loop at its first attempt. 

We assume that for a given (z,rz), {Vi+i^n) and {Vi^n+i) is true, and show that {Vi+i^n+i) is true. 
As the threads 7i„ and Itn+i do not have any failure in the first i retry loops, their entrance time in 
the z -|- 1**^ retry loop is given by 


Sn + 


E 


/ j ( + ) ( + ') n ^ (+) 

[rlwl, + pw\,') + pwl^-^ 


i 

Xx and s„+i '^{rlw['!''’ + pwfl^) + = ^ 2 , 

i' = l 


respectively. Thread 7t„ does not fail in the i + 1*^ retry loop, hence exits at 

Xi + rlw^*^x < Xi -I- rlwx'* = Sn + X2 - < X2. 


As the previous threads 7^_i,..., 7i exits the retry loop before 7^, and next threads 7^/, where 
n' > n-|- 1, enters this retry loop after 7^+i, this implies that the thread succeeds in the z-|- 
retry loop at its hrst attempt, and [Vi+i^n+i] is true. 

Regarding the first thread that succeeds in the first retry loop, we know that he successes in any 
retry loop since there is no other thread to compete with. Therefore, for all z G p,^]], {Vi^i) is 
true. Then we show by induction that all {V 2 ,n) is true, then all (Vs^n), etc., until all {Vs,n)i which 
concludes the proof. ■ 


Theorem 4. The multi-retry loop Procedure Combined is equivalent to the Procedure Abstract- 
\Algorithml where 


(+) (d 

pw'- ’ = pw\ 


s 

E 

i=2 


(pw\^^ + rlw[^A and rlw^*^ = rlw[^K 


Proof: According to Lemma 11 there is no failure in other retry loop than the hrst one; therefore, 
all retry loops have a constant duration, and can thus be considered as parallel sections. ■ 








23 


3) Expansion: The expansion in the retry loop starts as threads fail inside this retry loop. When 
threads are lannched, there is no expansion, and Lemma pT] implies that if threads fail, it should be 
inside the first retry loop, because it is the longest one. As a result, there will be some stall time 
in the memory accesses of this first retry loop, i.e. expansion, and it will get even longer. Failures 
will thus still occur in the first retry loop: there is a positive feedback on the expansion of the first 
retry loop that keeps this first retry loop as the longest one among all retry loops. Therefore, in 
accordance to Theorem we can compute the expansion by considering the equivalent single-retry 
loop procedure described in the theorem. 

VI. Experimental Evaluation 

We validate our model and analysis framework through successive steps, from synthetic tests, 
capturing a wide range of possible abstract algorithmic designs, to several reference implementations 
of extensively studied lock-free data structure designs that include cases with non-constant parallel 
section and retry loop. 


A. Setting 

We have conducted experiments on an Intel ccNUMA workstation system. The system is composed 
of two sockets, that is equipped with Intel Xeon E5-2687W v2 CPUs with frequency band 1.2-3.4. GHz 
The physical cores have private LI, L2 caches and they share an L3 cache, which is 25 MB. In a 
socket, the ring interconnect provides L3 cache accesses and core-to-core communication. Due to 
the bi-directionality of the ring interconnect, uncontended latencies for intra-socket communication 
between cores do not show significant variability. 

Our model assumes uniformity in the CAS and Read latencies on the shared cache line. Thus, 
threads are pinned to a single socket to minimize non-uniformity in Read and CAS latencies. In the 
experiments, we vary the number of threads between 4 and 8 since the maximum number of threads 
that can be used in the experiments are bounded by the number of physical cores that reside in one 
socket. 

In all hgures, y-axis provides the throughput, which is the number of successful operations 
completed per millisecond. Parallel work is represented in x-axis in cycles. The graphs contain the 
high and low estimates (see Section IV), corresponding to the lower and upper bound on the wasted 
retries, respectively, and an additional curve that shows the average of them. 

As mentioned before, the latencies of CAS and Read are parameters of our model. We used the 
methodology described in [DGT13] to measure latencies of these operations in a benchmark program 
by using two threads that are pinned to the same socket. The aim is to bring the cache line into the 
state used in our model. Our assumption is that the Read is conducted on an invalid line. For CAS, 
the state of the cache line could be exclusive, forward, shared or invalid. Regardless of the state 
of the cache line, CAS requests it for ownership, that compels invalidation in other cores, which in 
turn incurs a two-way communication and a memory fence afterwards to assure atomicity. Thus, the 
latency of CAS does not show negligible variability with respect to the state of the cache line, as 
also revealed in our latency benchmarks. 

As for the computation cost, the work inside the parallel section is implemented by a dummy 
for-loop of Pause instructions. 


B. Synthetic Tests 

1) Single retry loop: For the evaluation of our model, we first create synthetic tests that emulate 
different design patterns of lock-free data structures (value of cw) and different application contexts 
(value of pw). As described in the previous subsection, in the Procedure Abstract Algorithm the 
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Figure 10: Synthetic program 
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amount of work in both the parallel section and the retry loop are implemented as dummy loops, 
whose costs are adjusted through the number of iterations in the loop. 

we observe two main behaviors: when pw is high, the data 
and threads can operate without failure. When pw is low, the data 


Generally speaking, in Figure 10 
structure is not contended 


structure is contended, and depending on the size of cw (that drives the expansion) a steep decrease 
in throughput or just a roughly constant bound on the performance is observed. 

The position of the experimental curve between the high and low estimates, depends on cw. It can 
be observed that the experimental curve mostly tends upwards as cw gets smaller, possibly because 
the serialization of the CASs helps the synchronization of the threads. 

Another interesting fact is the waves appearing on the experimental curve, especially when the 
number of threads is low or the critical work big. This behavior is originating because of the variation 
of r with the change of parallel work, a fact that is captured by our analysis. 


■ Norm. Success ■ Fails RL1/Success "Fails RL2/Success" Total Fails / Success Low - High - Average 



Figure 11: Multiple retry loops with 8 threads 


2) Several retry loops: We have created experiments by combining several retry loops, each 

results are 


operating on an independent variable which is aligned to a cache line. In Figure 11 


compared with the model for single retry loop case where the single retry loop is equal to the 
longest retry loop, while the other retry loops are part of the parallel section. The distribution of 
fails in the retry loops are illustrated and all throughput curves are normalized with a factor of 175 
(to be easily seen in the same graph). Fails per success values are not normalized and a success is 
obtained after completing all retry loops. 


C. Treiber’s Stack 

The lock-free stack by Treiber |Tre86] is one of the most studied efficient data structures. Pop and 
Push both contain a retry loop, such that each retry starts with a Read and ends with CAS on the 
shared top pointer. In order to validate our model, we start by using Pops. From a stack which is 
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Case " Low " High " Average " Real 


cw = 50, threads = 6 , cw = 300, threads = 6 cw = 600, threads = 6 

7000 - 



Figure 12: Pop on Treiber’s stack 


initiated with 50 million elements, threads continuously pop elements for a given amount of time. 
We count the total number of pop operations per millisecond. Each Pop hrst reads the top pointer 
and gets the next pointer of the element to obtain the address of the second element in the stack, 
before attempting to CAS with the address of the second element. The access to the next pointer of 
the first element occurs in between the Read and the CAS. Thus, it represents the work in cw. This 
memory access can possibly introduce a costly cache miss depending on the locality of the popped 
element. 

To validate our model with different cw values, we make use of this costly cache miss possibility. 
We allocate a contiguous chunk of memory and align each element to a cache line. Then, we initialize 
the stack by pushing elements from contiguous memory either with a single or large stride to disable 
the prefetcher. When we measure the latency of cw in Pop for single and large stride cases, we obtain 
the values that are approximately 50 and 300 cycles, respectively. As a remark, 300 cycles is the 
cost of an L3 miss in our system when it is serviced from the local main memory module. To create 
more test cases with larger cw, we extended the stack implementation to pop multiple elements with 
a single operation. Thus, each access to the next element could introduce an additional L3 cache 
miss while popping multiple elements. By doing so, we created cases in which each thread pops 2, 
3, etc. elements, and cw goes to 600, 900, etc. cycles, respectively. In Figure 12 comparison of the 
experimental results from Treiber’s stack and our model is provided. 

As a remark, we did not implemented memory reclamation for our experiments but one can 
implement a stack that allows pop and push of multiple elements with small modihcations using 
hazard pointers |Mic04| . Pushing can be implemented in the same way as single element case. A 
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Algorithm 1: Multiple Pop 

1 Pop (multiple) 

2 while true do 

3 

t 

= Read(top); 

4 

for multiple do 

5 


if t = NULL then 

6 


return EMPTY; 

7 


hp* = t; 

8 


if top != t then 

9 


break; 

10 


hp++; 

11 


next = t.next; 

12 

il 

■ CAS(&top, t, next) then 

13 


break; 

14 RetireNodes (t, multiple); 


Pop requires some modifications for memory reclamation. It can be implemented by making use of 
hazard pointers just by adding the address of the next element to the hazard list before jumping 
to it. Also, the validity of top pointer should be checked after adding the pointer to the hazard list 
to make sure that other threads are aware of the newly added hazard pointer. By repeating this 
process, a thread can jump through multiple elements and pop all of them with a CAS at the end. 


D. Shared Counter 

In |DLM13) . the authors have implemented a “scalable statistics counters” relying on the following 
idea: when contention is low, the implementation is a regular concurrent counter with a CAS] when 
the counter starts to be contended, it switches to a statistical implementation, where the counter is 
actually incremented less frequently, but by a higher value. One key point of this algorithm is the 
switch point, which is decided thanks to the number of failed increments; our model can be used by 
providing the peak point of performance of the regular counter implementation as the switch point. 
We then have implemented a shared counter which is basically a Fetch-and-Increment using a CAS, 
and compared it with our analysis. The result is illustrated in Figure and shows that the parallel 
section size corresponding to the peak point is correctly estimated using our analysis. 

E. DeleteMin in Priority List 

We have applied our model to DeleteMin of the skiplist based priority queue designed in [L,113] . 
DeleteMin traverses the list from the beginning of the lowest level, finds the hrst node that is not 
logically deleted, and tries to delete it by marking. If the operation does not succeed, it continues 
with the next node. Physical removal is done in batches when reaching a threshold on the number of 
deleted prefixes, and is followed by a restructuring of the list by updating the higher level pointers, 
which is conducted by the thread that is successful in redirecting the head to the node deleted by 
itself. 

We consider the last link traversal before the logical deletion as critical work, as it continues 
with the next node in case of failure. The rest of the traversal is attributed to the parallel section 
as the threads can proceed concurrently without interference. We measured the average cost of a 
traversal under low contention for each number of threads, since traversal becomes expensive with 
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Case ■ Low ■ High : Average Real 
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(b) 6 threads 
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Figure 13: Increment on a shared counter 


more threads. In addition, average cost of restrnctnring is also inclnded in the parallel section since 
it is execnted infreqnently by a single thread. 

We initialize the priority qnene with a large set of elements. As illnstrated in Fignre[^ the smallest 
pw valne is not zero as the average cost of traversal and restrnctnring is intrinsically inclnded. The 
peak point is in the estimated place but the curve does not go down sharply under high contention. 
This presumably occurs as the traversal might require more than one steps (link access) after a failed 
attempt, which creates a back-off effect. 


F. Enqueue-Dequeue on a Queue 

In order to demonstrate the validity of the model with several retry loops, and that the results 
covers a wider spectrum of application and designs from the ones we focused in our model, we studied 
the following setting: the threads share a queue, and each thread enqueues an element, executes the 
parallel section, dequeues an element, and reiterates. We consider the queue implementation by 
Michael and Scott |MS96j . that is usually viewed as the reference queue while looking at lock-free 
queue implementations. 

Dequeue operations fit immediately into our model but Enqueue operations need an adjustment due 
to the helping mechanism. Note that without this helping mechanism, a simple queue implementation 
would fit directly, but we also want to show that the model is malleable, i.e. the fundamental behavior 
remains unchanged even if we divert slightly from the initial assumptions. We consider an equivalent 
execution that catches up with the model, and use it to approximate the performance of the actual 
execution of Enqueue. 
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Case ■ Low ■ High : Average Real 


Case ■ Low ■ High ‘ ^ Average Real 


cw = 50, threads = 4 cw = 50, threads = 6 
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Figure 14: Delete Min on a priority list 


Enqueue is composed of two steps. Firstly, the new node is attached to the last node of the queue 
via a CAS, that we denote by CAS a, leading to a transient state. Secondly, the tail is redirected to 
point to the new node via another CAS, that we denote by CASb, which brings back the queue into 
a steady state. 

A new Enqueue can not proceed before the two steps of previous success are completed. The first 
step is the linearization point of operation and the second step could be conducted by a different 
thread through the helping mechanism. In order to start a new Enqueue, concurrent Enqueues help 
the completion of the second step of the last success if they hnd the queue in the transient state. 
Alternatively, they try to attach their node to the queue if the queue is in the steady state at the 
instant of check. This process continues until they manage to attach their node to the queue via a 
retry loop in which state is checked and corresponding CAS is executed. 

The flow of an Enqueue is determined by this state checks. Thus, an Enqueue could execute multiple 
CASb (successful or failing) and multiple CASa (failing) in an interleaved manner, before succeeding 
in CASa at the end of the last retry. If we assume that both states are equally probable for a check 
instant which will then end up with a retry, the number of CAS s that ends up with a retry are 
expected to be distributed equally among CASa and CASb for each thread. In addition, each thread 
has a successful CASa (which linearizes the Enqueue) and a CASb at the end of the operation which 
could either be successful or failed by a concurrent helper thread. 

We imitate such an execution with an equivalent execution in which threads keep the same relative 
ordering of the invocation, return from Enqueue together with same result. In equivalent execution, 
threads alternate between CASa and CASb in their retries, and both steps of successful operation is 
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Figure 15: Enqueue-Dequeue on Michael and Scott queues 


conducted by the same thread. The equivalent execution can be obtained by thread-wise reordering 
of CAS s that leads to a retry and exchanging successful CASb s with the failed counterparts at 
the end of an Enqueue, as the latter ones indeed fail because of this success of helper threads. 
The model can be applied to this equivalent execution by attributing each CASa-CASb couple to a 
single iteration and represent it as a larger retry loop since the successful couple can not overlap 
with another successful one and all overlapping ones fail. With a straightforward extension of the 
expansion formula, we accomodate the CASa in the critical work which can also expand, and use 
CASb as the CAS of our model. 

In addition, we take one step further outside the analysis by including a new case, where the 
parallel section follows a Poisson distribution, instead of being constant, pw is chosen as the mean 
to generate Poisson distribution instead of taking it constant. The results are illustrated in Figure [TS] 
Our model provides good estimates for the constant pw and also reasonable results for the Poisson 
distribution case, although this case deviates from (/extends) our model assumptions. The advantage 
of regularity, which brings synchronization to threads, can be observed when the constant and Poisson 
distributions are compared. In the Poisson distribution, the threads start to fail with larger pw, which 
smoothes the curve around the peak of the throughput curve. 


G. Discussion 

In this subsection we discuss the adequacy of our model, specifically the cyclic argument, to 
capture the behavior that we observe in practice. Figure p(6| illustrates the frequency of occurrence of 
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Consecutive Fail Frequency 
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Figure 16: Consecutive Fails Frequency 


a given number of consecutive fails, together with average fails per success values and the throughput 
values, normalized by a constant factor so that they can be seen on the graph. In the background, 
the frequency of occurrence of a given number of consecutive fails before success is presented. As 
a remark, the frequency of 6+ fails is gathered with 6. We expect to see a frequency distribution 
concentrated around the average fails per success value, within the bounds computed by our model. 

While comparing the distribution of failures with the throughput, we could conjecture that the 
bumps come from the fact that the failures spread out. However, our model captures correctly the 
throughput variations and thus strips down the right impacting factor. The spread of the distribution 
of failures indicates the violation of a stable cyclic execution (that takes place in our model), but 
in these regions, r actually gets close to 0, as well as the minimum of all gaps. The scattering in 
failures shows that, during the execution, a thread is overtaken by another one. Still, as gaps are 
close to 0, the imaginary execution, in which we switch the two thread IDs, would create almost the 
same performance effect. This reasoning is strengthened by the fact that the actual average number 
of failures follows the step behavior, predicted by our model. This shows that even when the real 
execution is not cyclic and the distribution of failures is not concentrated, our model that results in 
a cyclic execution remains a close approximation of the actual execution. 

H. Back-Off Tuning 

Together with the analysis comes a natural back-off strategy: we estimate the pw corresponding to 
the peak point of the average curve, and when the parallel section is smaller than the corresponding 
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Figure 17: Comparison of back-off schemes for Poisson Distribution 


pw, we add a back-off in the parallel section, so that the new parallel section is at the peak point. 

We have applied exponential, linear and onr back-off strategy to the Enqueue/Dequeue experiment 
specified above. Onr back-off estimate provides good results for both types of distribution. In 
Figure 17 (where the values of back-off are steps of 115 cycles), the comparison is plotted for the 
Poisson distribution, which is likely to be the worst for our back-off. Our back-off strategy is better 
than the other, except for very small parallel sections, but other back-off strategies should be tuned 
for each value of pw. 

We obtained the same shapes while removing the distribution law and considering constant values. 


The results are illustrated in Figure 18 
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Figure 18: Comparison of back-off schemes for constant pw 
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VII. Conclusion 

In this paper, we have modeled and analyzed the performance of a general class of lock-free 
algorithms, and have so been able to predict the thronghpnt of snch algorithms, on actnal execntions. 
The analysis rely on the estimation of two impacting factors that lower the thronghpnt: on the 
one hand, the expansion, dne to the serialization of the atomic primitives that take place in the 
retry loops; on the other hand, the wasted retries, dne to a non-optimal synchronization between 
the rnnning threads. We have derived methods to calculate those parameters, along with the final 
throughput estimate, that is calculated from a combination of these two previous parameters. As 
a side result of our work, this accurate prediction enables the design of a back-off technique that 
performs better than other well-known techniques, namely linear and exponential back-offs. 

As a future work, we envision to enlarge the domain of validity of the model, in order to cope with 
data structures whose operations do not have constant retry loop, as well as the framework, so that 
it includes more various access patterns. The fact that our results extend outside the model allows us 
to be optimistic on the identification of the right impacting factors. Finally, we also foresee studying 
back-off techniques that would combine a back-off in the parallel section (for lower contention) and 
in the retry loops (for higher robustness). 
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